833 research outputs found

    Faster Algorithms for the Constrained k-Means Problem

    Get PDF
    The classical center based clustering problems such as k-means/median/center assume that the optimal clusters satisfy the locality property that the points in the same cluster are close to each other. A number of clustering problems arise in machine learning where the optimal clusters do not follow such a locality property. For instance, consider the r-gather clustering problem where there is an additional constraint that each of the clusters should have at least r points or the capacitated clustering problem where there is an upper bound on the cluster sizes. Consider a variant of the k-means problem that may be regarded as a general version of such problems. Here, the optimal clusters O_1, ..., O_k are an arbitrary partition of the dataset and the goal is to output k-centers c_1, ..., c_k such that the objective function sum_{i=1}^{k} sum_{x in O_{i}} ||x - c_{i}||^2 is minimized. It is not difficult to argue that any algorithm (without knowing the optimal clusters) that outputs a single set of k centers, will not behave well as far as optimizing the above objective function is concerned. However, this does not rule out the existence of algorithms that output a list of such k centers such that at least one of these k centers behaves well. Given an error parameter epsilon > 0, let l denote the size of the smallest list of k-centers such that at least one of the k-centers gives a (1+epsilon) approximation w.r.t. the objective function above. In this paper, we show an upper bound on l by giving a randomized algorithm that outputs a list of 2^{~O(k/epsilon)} k-centers. We also give a closely matching lower bound of 2^{~Omega(k/sqrt{epsilon})}. Moreover, our algorithm runs in time O(n * d * 2^{~O(k/epsilon)}). This is a significant improvement over the previous result of Ding and Xu who gave an algorithm with running time O(n * d * (log{n})^{k} * 2^{poly(k/epsilon)}) and output a list of size O((log{n})^k * 2^{poly(k/epsilon)}). Our techniques generalize for the k-median problem and for many other settings where non-Euclidean distance measures are involved

    Noisy, Greedy and Not so Greedy k-Means++

    Get PDF

    Hardness of Approximation for Euclidean k-Median

    Get PDF
    The Euclidean k-median problem is defined in the following manner: given a set ? of n points in d-dimensional Euclidean space ?^d, and an integer k, find a set C ? ?^d of k points (called centers) such that the cost function ?(C,?) ? ?_{x ? ?} min_{c ? C} ?x-c?? is minimized. The Euclidean k-means problem is defined similarly by replacing the distance with squared Euclidean distance in the cost function. Various hardness of approximation results are known for the Euclidean k-means problem [Pranjal Awasthi et al., 2015; Euiwoong Lee et al., 2017; Vincent Cohen{-}Addad and {Karthik {C. S.}}, 2019]. However, no hardness of approximation result was known for the Euclidean k-median problem. In this work, assuming the unique games conjecture (UGC), we provide the hardness of approximation result for the Euclidean k-median problem in O(log k) dimensional space. This solves an open question posed explicitly in the work of Awasthi et al. [Pranjal Awasthi et al., 2015]. Furthermore, we study the hardness of approximation for the Euclidean k-means/k-median problems in the bi-criteria setting where an algorithm is allowed to choose more than k centers. That is, bi-criteria approximation algorithms are allowed to output ? k centers (for constant ? > 1) and the approximation ratio is computed with respect to the optimal k-means/k-median cost. We show the hardness of bi-criteria approximation result for the Euclidean k-median problem for any ? < 1.015, assuming UGC. We also show a similar hardness of bi-criteria approximation result for the Euclidean k-means problem with a stronger bound of ? < 1.28, again assuming UGC

    Calcia-doped yttria-stabilized zirconia for thermal barrier coatings: synthesis and characterization

    Get PDF
    Doping with other oxides has been a stabilization method of ZrO2 for thermal barrier coating applications. Such a stabilized system is 7-8mol% YO1.5-doped zirconia (7YSZ), which has been in use for around 20years. In this study, calcia (CaO) and yttria (Y2O3) have been used for doping ZrO2 to produce a stable single-phase cubic calcia-doped yttria-stabilized zirconia (CaYSZ). This has been synthesized using wet chemical synthesis as well as by solid-state synthesis. Unlike partially stabilized zirconia where 5mol% CaO is doped into ZrO2, CaYSZ has been found to be stable up to 1600°C. Detailed CaYSZ synthesis steps and phase characterization are presented. Wet chemical synthesis resulted in a stable single-phase CaYSZ just after 4h treatment at 1400°C, whereas a 36h annealing at 1600°C is required for CaYSZ synthesis during solid-state processing. The CaYSZ has been found stable even for 600h at 1250°C. Coefficient of thermal expansion and sintering temperature of CaYSZ was found to be 11×10−6K−1 and 1220°C, respectively, which are comparable to 7YSZ. An increase in sintering rate with increasing dopant concentration has also been observe

    Triangle Estimation Using Tripartite Independent Set Queries

    Get PDF
    Estimating the number of triangles in a graph is one of the most fundamental problems in sublinear algorithms. In this work, we provide an approximate triangle counting algorithm using only polylogarithmic queries when the number of triangles on any edge in the graph is polylogarithmically bounded. Our query oracle Tripartite Independent Set (TIS) takes three disjoint sets of vertices A, B and C as input, and answers whether there exists a triangle having one endpoint in each of these three sets. Our query model generally belongs to the class of group queries (Ron and Tsur, ACM ToCT, 2016; Dell and Lapinskas, STOC 2018) and in particular is inspired by the Bipartite Independent Set (BIS) query oracle of Beame et al. (ITCS 2018). We extend the algorithmic framework of Beame et al., with TIS replacing BIS, for triangle counting using ideas from color coding due to Alon et al. (J. ACM, 1995) and a concentration inequality for sums of random variables with bounded dependency (Janson, Rand. Struct. Alg., 2004)

    Approximate Clustering with Same-Cluster Queries

    Get PDF
    Ashtiani et al. proposed a Semi-Supervised Active Clustering framework (SSAC), where the learner is allowed to make adaptive queries to a domain expert. The queries are of the kind "do two given points belong to the same optimal cluster?", where the answers to these queries are assumed to be consistent with a unique optimal solution. There are many clustering contexts where such same cluster queries are feasible. Ashtiani et al. exhibited the power of such queries by showing that any instance of the k-means clustering problem, with additional margin assumption, can be solved efficiently if one is allowed to make O(k^2 log{k} + k log{n}) same-cluster queries. This is interesting since the k-means problem, even with the margin assumption, is NP-hard. In this paper, we extend the work of Ashtiani et al. to the approximation setting by showing that a few of such same-cluster queries enables one to get a polynomial-time (1+eps)-approximation algorithm for the k-means problem without any margin assumption on the input dataset. Again, this is interesting since the k-means problem is NP-hard to approximate within a factor (1+c) for a fixed constant 0 < c < 1. The number of same-cluster queries used by the algorithm is poly(k/eps) which is independent of the size n of the dataset. Our algorithm is based on the D^2-sampling technique, also known as the k-means++ seeding algorithm. We also give a conditional lower bound on the number of same-cluster queries showing that if the Exponential Time Hypothesis (ETH) holds, then any such efficient query algorithm needs to make Omega (k/poly log k) same-cluster queries. Our algorithm can be extended for the case where the query answers are wrong with some bounded probability. Another result we show for the k-means++ seeding is that a small modification of the k-means++ seeding within the SSAC framework converts it to a constant factor approximation algorithm instead of the well known O(log k)-approximation algorithm

    Machine learning-based Naive Bayes approach for divulgence of Spam Comment in Youtube station

    Get PDF
    In the 21st Century, web-based media assumes an indispensable part in the interaction and communication of civilization. As an illustration of web-based media viz. YouTube, Facebook, Twitter, etc., can increase the social regard of a person just as a gathering. Yet, every innovation has its pros as well as cons. In some YouTube channels, a machine-made spam remark is produced on that recordings, moreover, a few phony clients additionally remark a spam comment which creates an adverse effect on that YouTube channel.  The spam remarks can be distinguished by using AI (artificial intelligence) which is based on different Algorithms namely Naive Bayes, SVM, Random Forest, ANN, etc. The present investigation is focussed on a machine learning-based Naive Bayes classifier ordered methodology for the identification of spam remarks on YouTub
    • …
    corecore